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Abstract 

This paper is concerned with adaptive nonparametric estimation using the 
Goldenshluger-Lepski selection method. This estimator selection method is 
based on pairwise comparisons between estimators with respect to some loss 
function. The method also involves a penalty term that typically needs to 
be large enough in order that the method works (in the sense that one can 
prove some oracle type inequality for the selected estimator). In the case 
of density estimation with kernel estimators and a quadratic loss, we show 
that the procedure fails if the penalty term is chosen smaller than some 
critical value for the penalty: the minimal penalty. More precisely we show 
that the quadratic risk of the selected estimator explodes when the penalty 
is below this critical value while it stays under control when the penalty 
is above this critical value. This kind of phase transition phenomenon for 
penalty calibration has already been observed and proved for penalized model 
selection methods in various contexts but appears here for the first time for 
the Goldenshluger-Lepski pairwise comparison method. Some simulations 
illustrate the theoretical results and lead to some hints on how to use the 
theory to calibrate the method in practice. 

Keywords: Nonparametric statistics, Adaptive estimation, Minimal 
penalty 
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1. Introduction 


Adaptive estimation is a challenging task in nonparametric estimation. 
Many methods have been proposed and studied in the literature. Most of 
them rely on some data-driv en selection of an esti mator among a give n collec- 
tion. Wavelet thresholding (IDonoho et all 119961) . Lepsk i’s method flLepskij 


19 90). and model selection ( Barron. Birge. and Massartl. 1999 ) (see also Birge 
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(l200ll) for the link between model selection and Lepski’s method) belong to 
this category. Designing proper estimator selection is an issue by itself. From 
a constructive point of view, it is a crucial step towards adaptive estimation. 

For instance, selecting a bandwidth for kernel estimators in density estima¬ 
tion means that you are able to estimate the density without specifying its 
degree of smoothness in advance. Recentl y an interesting new estim a tor se - 
lection procedure has been introduced by Goldenshluger and Lepski ( 2008 1. 
Assume that one wants to estimate some unknown function / belonging to 
some function space endowed with some norm ||.||. Assume also that we 
have at our disposal some collection of estimators (fh)hen indexed by some 
parameter h, the issue being to select some estimator f- h among this collec¬ 
tion. The Goldenshluger-Lepski method proposes to select h as a minimizer 
of B(h ) + V{h) with 

B(h) = sup{[||A, - .Ml 2 - V(h')\ + , ti e U) 

where x + denotes the positive part max(x, 0) and where fh,h / are auxiliary 
(typically oversmoothed) estimators and V(h) is a penalty term (called ”ma- 
jorant” by Goldenshluger and Lepski) to be suitably ch osen . Th ey first de¬ 
velop their methodology in the white noise f ramework (iGoldenshluger and Lepskil . 


2008], 20090, next for density estimation (IGoldenshluger and Lepsk i. 12 01111 


and then for various other frameworks (IGol de nshluger and Lepski, 120131 1. 
Their initial motivation was to provide adaptive procedures for multivariate 
and anisotropic estimation and they used the versatility of their method to 
prove that the selected estimators can achieve minim ax rates o f co nvergence 


over some very general classes of smooth functions (see IGoldenshluger and Lepskil . 


2 014 1. To this purpose, they have established oracle inequalities to ensure 


that, if V(h) is well chosen, the final estimator f~ h is almost as efficient as 
the best one in the collection. The Goldenshluger-Lepski methodology has 
already be en fruitfully applied in various contexts: transp ort-fragmentation 
equat ions (IDoumic et all . 2012), ani sotr opic deconvolution ( Comte and Laeour . 
20130 . warped bases regression ( Chagnvl. 2013 1 among others (see also Bertin et a! 
(1201511 which contains some explanation on the meth odo l ogy). We canno t 
close this paragraph without citing the nice work of Laurent et al.l ( 2008 ). 
who have independently introduced a very similar method, in order to adapt 
the model selection point of view to pointwise estimation. 

In this paper we focus on the issue of calibrating the penalty term V. As 
we mentioned above the ’’positive” known results are of the following kind: 
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the method performs well (at least from a theoretical view point) when V is 
well chosen. More precisely one is able to prove oracle inequalities only if V 
is not too small. But the issue is now: what is the minimal (or the optimal) 
value for V to preserve (or optimize) the performance of the method? Here 
we consider this issue from a theoretical point of view but actually it is a 
crucial issue for a practical implementation of the method. In this paper we 
focus on the (simple) classical bandwidth selection issue for kernel estimators 
in the framework of univariate density estimation. The main contribution 
of this paper is to highlight a phase transition phenomenon that can be 
roughly described as follows. For some critical quantity Vq (that we call 
’’minimal penalty”) if the penalty term V is defined as V = a Vo then either 
a < 1 and the risk E||/ — f ~ h \\ 2 is proven to be dramatically suboptimal, or 
a > 1 and the risk remains under control. This kind of phase transition 
phenomenon and its possi ble use for penalty calibration appeared for the 
first time in Birge and Massart ( 2007h in the context of Gaussian penalized 
model selection. It is interesting to see that the same phenomenon occurs 
for a pairwise comparison based selection method such as the Goldenshluger- 
Lepski method. 

Proofs are extensively based on concentration inequalities. In particular, 
left tail concentration inequalities are used to prove the explosion result be¬ 
low the critical value for the penalty. Although the probabilistic tools are 
non asymptotic by essence, they merely allow us to justify that suprema of 
empirical processes are well concentrated around their expectations and the 
approximations that we make on those expectations are indeed asymptotic. 
Needless to say this means that our final results are (unfortunately) a bit of 
an asymptotic nature, at least as far as the identification of the critical value 
a = 1 is concerned. To be more concrete, we mean that for a given unknown 
density and a given sample size n, it is unclear that a phase transition phe¬ 
nomenon (if any) should occur at the critical value a = 1 as predicted by the 
(asymptotic) theory. But still, because of the concentration phenomenon, 
one can hope that some phase transition does occur (even non asymptoti¬ 
cally) at some critical value even though it is not equal (or even close) to 
the (asymptotic) value a = 1. To check this, we have also implemented nu¬ 
merical simulations. These simulations allow us to understand what should 
be retained from the theory as a typical behavior of the method. In fact the 
simulations confirm the above scenario. It turns out that the phase transition 
does occur when you run simulations even though the critical point is not lo¬ 
cated at a = 1. This is actually what should be retained from the theory (at 
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least from our point of view). The fact that some phase transition does occur 
is good news for the calibration issue because this means that in practice you 
can detect the critical value from the data (forgetting about the asymptotic 
value a = 1). Then you can hope to use this value to elaborate some fully 
data-driven and non asymptotic calibration of the method. We conclude the 
paper with providing some hints on how to perform that explicitly. 

In Section [2] we specify the statistical framework and we recall the oracle 
inequality that can be obtained in the framework of density estimation. Then 
Section [3] contains our main theorem about minimal penalty. This result is 
illustrated by some simulations (SectionHJ). Finally, some proofs are gathered 
in Section [H] after some concluding remarks. 


2. Kernel density estimation framework and upper bound on the 
risk 


We consider independent and identically distributed real variables Xi,..., X n 
with unknown density / with respect to the Lebesgue measure on the real 
line. Let ||.|| denote the L 2 norm with respect to the Lebesgue measure. For 
each positive number h (the bandwidth) we can define the classical kernel 
density estimator 

1 n 

fh{x ) = - V K h(x - Xi) (1) 

n z —' 

i= 1 

where K is a kernel and K\ = K(./h)/h. We assume here that the func¬ 
tion to be estimated is univariate and we study the Goldenshluger-Lepski 
methodology without oversmoothing. This means that we do not use aux¬ 
iliary estimators. We could actually prove the same results for the original 
method but the proofs are more involved and we decided to keep the proofs 
as simple as possible trying not to hide the heart of the matter. 

To be more precise the procedure that we study is the following one: 
starting from some (finite) collection of estimators {fh, h e H}, we set 


B(h ) 


sup 

h'<h 


\\f h '-fk\\ 2 -V(h') 


with V(h!) 



n 


( 2 ) 


with a being the tuning parameter of interest. Then the selected bandwidth 
is defined by 


h 


argmin { B{h) + V(h)}. 


( 3 ) 
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It is worth noticing that the penalty term V(h) which is used here is exactly 
proportional to the integrated variance of the corresponding estimator. 

We introduce the following notation: 

/ h :=E(/ ft ), h min := min"H, h max := max'H 

D(h) := max (sup || f h , - f h ||, \\f - f h \\) < 2 sup \\f h , - f\\ 

h'<h h'<h 


We assume that the kernel verifies assumption 
(KO) /1 K\ = 1, Hit'll < oo and 


VO < x < 1 


(K,K(x.)) 

\\K\P 


Assumption (KO) is satished whenever the kernel K is nonnegative and uni- 
modal with a mode at 0. Indeed in this case K(xu) > K(u ) for all uEl and 
x G [0,1]. This is verified for classical kernels (Gaussian kernel, rectangular 
kernel, Epanechnikov kernel, biweight kernel; see Lemma H|). This entails 
that for all h' < h, || K h > — K h \\ 2 < ||1| 2 — ||AA|| 2 . This Pythagore type 
inequality is a one of the key properties that we shall use for proving our 
results. 

Let us now recall the positive results that can be obtained for the selection 
method if a is well chosen. 

Proposition 1. Assume that f is bounded and K verifies (KO). Let f- h be 
the selected estimator defined by ©, ©. ©• Assume that the parameter a 
in the penalty V satisfies a > 1. 

• There exist some positive constants Co > 0 and c > 0 such that, with 
probability larger than 

1 - 2 ^2 max(e~ c ^, e~ c/h '), 

heH h'<h 

the following holds 

llA-/ll<C„inf{ flW + ^ 

The values Co = 1 + -\/2(l + (a 1 / 3 — l) -1 ) and c = (a 1 / 3 — l) 2 min( JjAL) 
are suitable. 
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• Moreover, if n 1 < h min < h max < log "(n) ; there exists a positive 
constant C depending only on ||A"|| and ||/||oo such that 


E\\f h -f\\ 2 <2C 2 M 


D 2 {h) + a 


\K h 


n 


Cn\Uf 


,V3 


—l°g 2 (n) 


(C = max(24,6||/|| 0O /||Ji || 2 ,4||/|| 00 + 4||A'|| 2 ) works). 

We recognize in the right-hand side of the oracle type inequalities above 
the classical bias variance tradeoff. This oracle inequality shows that the 
Goldenshluger-Lepski methodology works when a > 1, at least for n larger 
than some integer depending on a and the true density. From a non asymp¬ 
totic perspective this ’’positive result” should be understood with caution, 
it is clear from the analysis of the behavior of the constants involved with 
respect to a that these constants are worse when a is close to 1. 

The proof of Proposition |T] is postponed in Section 16.11 ft is bas ed on 
the following concentration result (adapted from Klein and Rio (120051) 1 and 
more precisely on inequality ()4]) below. 

Lemme 2. Let X\,, X n be a sequence of i.i.d. variables and v(t) = 
n _1 Y^i=i[9t(Xi) ~ E(gt(X;))] fort belonging to a countable set of functions 
T. Assume that for all t e T ||^||oo < b and Var(g 4 (Xi)) < v. Denote 
H = E(sup tgJ r v{t)). Then , for any e > 0, for H' > H, 

e 2 nH' 2 


P(sup v{t) > (1 + s)H') < max 

teJ 7 

P(sup u{t) < H — eH') < max 

t&T 


exp 


exp 


6 v 
e 2 nH' 2 
6 v 


, exp 


exp 


min(e, l)e nH’ 

24 V 

min(e, l)e nH 1 
24 V 


Moreover 


. ... v , bH 

Var(supz/(f)) < —h 4— 
t&r n n 


( 6 ) 


( 4 ) 

( 5 ) 


3. Minimal penalty 

In this section, we are interested in finding a minimal penalty V(h), be¬ 
yond which the procedure fails. Indeed, if a and then V(h) is too small, 
the minimization of the criterion amounts to minimize the bias, and then to 
choose the smallest possible bandwidth. This leads to the worst estimator 
and the risk explodes. 

In the following result h min denotes the smallest bandwidth in TL and is 
of order 1 jn. 
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Theorem 3. Assume that f is bounded. Choose EL = {e~ fc , |~2loglogn] < 
k < [lognj} as a set of bandwidths. Consider for K the Gaussian kernel, the 
rectangular kernel, the Epanechnikov kernel or the biweight kernel. If a < 1, 
then there exists C > 0 (depending on f, a, K) such that, for n large enough 
(depending on f and K), the selected bandwidth h defined by ([2]) and ([3]) 
satisfies 

p (h > 3/i m in) < C(logn) 2 exp(—(log n) 2 /C) 
i.e. h < 3 h min with high probability. Moreover 

liminf E||/ - /J 2 > 0. 


This theorem is proved in Section 16.21 for more general kernels and band- 
with sets. Here we have simplified the conditions on EL for the sake of 
readability. Actually the real condition on EL for Theorem 3 is that E% = 
min {h/h'] h G EL,hi G EL,h > h'} does not depend on n and is larger than 
1. It can be verified for the highlighted set EL = {e~ k , a n < k < b n }, but for 
% = {c n + d n k, a n < k < b n } as well. 

Mathematically, the proof of this result relies on two main arguments. 
The first argument is probabilistic: roughly speaking concentration inequal¬ 
ities which allow to deal with expectations of the pairwise square distances 
between estimators instead of the square distances themselves. The other ar¬ 
gument is analytical: it essentially relies on proper substitutes to Pythagoras’ 
formula for kernel smoothing. The phase transition phenomenon is actually 
easier to highlight in a context for which we have the actual Pythagoras’ iden¬ 
tity at our disposal, s ee t h e discussio n on p roject ion estimators for Gaussian 
white noise model in ILacour and Massart (120151 ). 

Theorem |3] ensures that the critical value for the parameter a is 1. Beyond 
this value, the selected bandwidth h is of order 1/n, which is very small 
(remember that for minimax study of a density with regularity a, the optimal 
bandwidth is n _1 /( 2q! + 1 )), then the risk cannot tend to 0. 


4. Simulations 

In this Section, we illustrate the role of tuning parameter a, the constant 
in the penalty term V. The aim is to observe the evolution of the risk for 
various values of a. Is the critical value a = 1 observable in practice? To do 
this, we simulate data X 1 ,... ,X n for several densites /. Next, for a grid of 
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values for a, we compute the selected bandwidth h, the estimator f~ h and the 
integrated loss ||/^ — /1| 2 . 


Cauchy 



Mix2Gauss 



Uniform 


Expo 



Figure 1: Plots of true density / for Examples 1-6 

We consider the following examples, see Figure [H 
Example 1 / is the Cauchy density 
Example 2 / is the uniform density 0,1) 

Example 3 / is the exponential density £(1) 

Example 4 / is a mixture of two normal densities |W(0,1) + yA/"(3, 9) 
Example 5 / is a mixture of normal densities sometimes called Claw 
Example 6 / is a mixture of eight uniform densities 


We implement the method for various kernels, but we only present results 
for Gaussian kernel, since the choice of kernel does not modify the results. 















On the other hand, the method is sensitive to the choice of bandwidths set 
H: here we use 


H = {e~ k , 3 < k < 10} U {0.002 + k x 0.02,0 < k < 24}. 


Note that the theoretical conditions on the bandwidths are asymptotic. Then, 
they have no real sense in our simulations with given n. In practice, this set 
must be rich enough for catching optimal bandwidths for a large class of den¬ 
sities, but small enough for the computation time. For our study, we choose 
equally distributed bandwidths for a good observation of the choice of h, 
and we also add the set {e~ k , 3 < k < 10} to have very small bandwidths 
avalaible, which are useful for irregular densities. 

For n = 5000 and n = 50000, and several values of a, the Figure [2] plots 


C 0 = 


e nA-zir 

min hen || f h - /|| 2 


where E means the empirical mean on N — 50 experiments. Thus smaller 
C 0 better the estimation. Moreover, we also plot on Figure [3] the selected 
bandwidth compared to the optimal bandwidth in the selection (for N — 1 
experiment), i.e. 

h-h 0 where || A 0 -/|| 2 = min ||/ ft -/|| 2 . 


We can observe that the risk (and then the oracle constant Co) is very 
high for small values of a, as expected. Then it jumps to a small value, that 
indicates the method begins to work well. For too large values of a the risk 
finally goes back up. Thus we observe in practice the transition phenomenon 
that was announced by the theory. However, contrary to the theoretical 
results, the critical value may be not exactly at a = 1, especially for small 
values of n. As already mentioned above this is related to the asymptotic 
nature of the theoretical results that we have obtained. For irregular densities 
(examples 2, 5, 6), the optimal bandwidth is very low, then it is consistent 
to observe a smaller jump for the bandwidth choice. However the jump does 
exist and this is the interesting point. We can also observe that the optimal 
value for a seems to be very close to the jump point. That may pose a 
problem of calibration and this is what we would like to discuss now. 
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n = 5000 n = 50000 




Figure 2: Oracle constant Cq as a function of a, for Examples 1-6 


5. Discussion 


To calibrate the penalty V, we face two practical problems: first, the op¬ 
timal value for a seems to be extremely close to the minimal value; secondly, 
this latter value is not necessarily equal to the (asymptotic) theoretical value 
a = 1. In order to clearly separate the optimal value from the minimal, 
we propose to use some slightly different procedure, which depends on two 
possibly different penalty parameters instead of one as in the previous one. 


B(h ) = sup 


h'<h 


II A' - All 2 - a- 


\K, 


h'\ 


n 


h = arg min < B(h) + b- 
hen 1 


I AT 


n 


with b ^ a. Of course this procedure is merely the one that we have previ¬ 
ously studied when a = b. Our belief is that taking a and b to be different 
leads to a better and more stable calibration. A good track for practical pur¬ 
pose seems to use the procedure of Section [2] to find a where there is a jump 
in the risk (in practice this jump can be detected on the selected bandwidths) 
and then to choose b = 2a. Once again, what is important for practical cal¬ 
ibration of the penalty is not that the jump appears at a = 1 (this value 
should be considered as some ’’asymptopia” which is never achieved) but 
that the jump does exist so that it becomes possible to use the calibration 


10 
































hc-ho 


n = 5000 n = 50000 




Figure 3: h — ho as a function of a, for Examples 1-6 

strategy that we just described. Proving theoretical results for this proce¬ 
dure is another interesting issue related to optimality considerations for the 
penalty that we do not intend to address here. 

6. Proofs 

6.1. Proof of Proposition [7] 

The first step is to write, for some fixed h G PL, 

114-/II < 114-411 + 114-/11- 

The last term can be splitted in ||A - f h \\ + \\f h - f\\ < \\f h - f h \\ + D(h). 
Notice that for all h! < h, using ([2]), \\fh' — A|| 2 < B{h) + V(h'), which can 
be written , for all h, h'] 

Wfv - f h \\ 2 < B(hv h') + V(h Ah') 

where h V h! = ma x(h, h') and h A h! = min(h, h'). Then, using (|3j) . 

||4 - 4|| 2 < B[h V h) + V(h A h)< B{h) + V{h) + ma x(B(h), V(h)). 

We obtain, for any h G PL. 

II h - /II < \AW) + 2 V{h) + D(h) + 1| A - /„||. 
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Thus the heart of the proof is to control B{h) = sup h , <h 
by a bias term. First we center the variables and write 



k\\ 2 -v{h') 


IIA'— All 2 < (i + All A'— fh' — fh + fh \\ 2 + (i + £ 1 )||/h / — /fell 2 , 

with £ some positive real to specified later. Moreover \\f h r — f h , — A + /fell — 
sup tgS u(t) where B is the unit ball in L 2 and 


1 

v(t) = {t, fh' - fh' - A + fh) = - y - E (gt(Xi)) 

i= 1 


with 

g t (X)= j{K h ,-K h )(x-X)t{x)dx. 

We shall now use the concentration inequality stated in Lemma EJ with 
IF a countable set in B such that sup tgJF u(t) = sup tgB u(t) (this equality is 
true for any dense subset of B for the L 2 topology, since u is continuous). 
To apply result (j3J), we need to compute b, H and v. 

• For all y G M, since t E B, 


\g t {y)\ = | / {K h '-K h )(x-y)t(x)dx\ < \\K h >-K h 


< \\K h ,-K h \\ < \\K, 


h 


‘■fe' 


so that b = || A"/,/1|. We used assumption (KO) which implies, for h' < h, 
|| K h , - K h f < \\K h ,f - \\K h f < \\K h ,f. 

Jensen’s inequality gives H 2 < E(sup tgJ r z/ 2 (t)). Now 

,2u ^ ~ ||A' — fh' ~ fh + /fe || 2 


sup u 2 (t) 
tex 


E(sup v 2 (t)) 
tex 


II- V [K h , - K h )(x - Xi) - E((AV - K h ){x - Xf) 
n 

i =1 

r i n 

/ Var(- y^f K h' - K h )(x - Xi))dx 
J n i= 1 


< 


n 

1 

n 


Vax((K h t — K h )(x — Xi))dx 
E((K h , - K h ) 2 {x - X,))dx 


(A 

( 8 ) 


< ~\\K h . - K h \\ 2 <-\\K h ,\ 
n n 
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Then H 2 < rr 1 ||-* 0 »'|| 2 - 
• For the variance term, let us write 


VarQ^XO) < E 


< E 


(K h > - K h )(x - X)t(x)dx) 


\K h , -K h \(x-X)dx 


E 


| Kh> — Kh\(x — X)t 2 (x)dx 


<4M||/|| 


2 """oo||t || 2 


< \\K h ,-K h \\l 
since || K h , - < 2||X||i- Then v = 4||X|| 2 ||/|| 0O = 4U/IU 

e 2 Ae r _ £ 2 \\K\\ 2 j_ 

Finally, using (HJ) , with probability larger than 1 — / Jh i <h max ( e 24 e 24 H-fH°° h ') 

w<hen \\f h , - f h , - A + Ml < (1 + e )- 11 ^' 11 


n 


where we choose e such that a > (1 + e) 3 . Then, with probability larger than 


£/*=« £/»'</» max(e- 


,e 


'win 


for any h, 


B(h ) < (1 + e~ L )D(h) 


In the same way, choosing 0 < e < y/a—1, we can prove that, with probability 

e 2 Ae e 2 ||K|| 2 1 

1 — max ( e ^ , e 6 nTiioo for any h, 


llA-MI<(i + f)Af 

v ti 


Finally, with high probability, 

114-/II < 42(1 + £-i)B(fc )2 + 21 7(h) + D(h)+ s/v(h) 

/ (\/2(l + £ -1 ) + 1) (^D(h) + 412(A)) 

To conclude we choose e = e = a 1 / 3 — 1. Regarding the second result, note 
that the rough bound ||M| 2 — ||Mi|| 2 < |A'|| 2 /h min is valid for all h. Then, 
denoting A the set on which the previous oracle inequality is verified, 

E|| A - /II 2 < E|| A - fftA + 2(||/|| 2 + ||X|| 2 /h min )P(A c ) 
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with 


P(kL c ) <2 ^ max(e- c ^,e-^) < 2\U\ 2 e~ c/hnm 

h,h'&H 


6.2. Proof of Theorem 0 

We shall prove a more general version of the theorem, where several band- 
widths sets TL and kernels K are possible. We denote Crit(h) := B{h) + V(h ) 
and E n = min {h/h!\ h E PL, h! e TL, h > h'}. We assume that E n does not 
depend on n and is larger than l (PL — {e~ fc , a n < k < b n } suits with E% = e). 
Let us define 

<»(x) = \\K\\-*\\K - K x f = 1 + \ - 2W|L^. 

We assume that the kernel K satisfies : 

(Kl) the function f> is bounded from below over [E-^, +oo), 

(K2) for 0 < n < 1, the function <f>(x) — - tends to +oo when x —» 0 and is 
decreasing in some neighborhood of 0, 

(K3) for 0 < n < 1, the function <f>(x) + - is increasing for x > 2. 

These assumptions are mild, as shown in the following Lemma, proved in 
Section 16.31 

Lemme 4. The following kernels satisfy assumptions (KO—K3) : 
a - Gaussian kernel: K(x) = e~ x2 / 2 /\/27r 
b - Rectangular kernel: K(x) = l[_i i i](x)/2 
c - Epanechnikov kernel: K(x) = (3/4)(1 — x 2 )l[_i i i](x) 
d - Biweight kernel: K(x) = (15/16)(1 — x 2 ) 2 l[_i i i](a:) 

The general result is: 
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Theorem 5. Assume (KO—K3) and that f is bounded. Assume that E ^ does 
not depend on n and /i max —>■ 0 when n —* oo. We also assume that there exist 
9\ < 6*2 reals such that 9 2 > 2, 9i.h min E V. and <f>(e 2 ) — 0(6*i) > l/0\ — l/9 2 . 

Then, if a < 1, there exists C = Cdl/Hoo) > 0 such that, for n large 
enough (depending on f,TL,K), 

p (h > e 2 h min ) < Y Y max -( e ~ C£2 ^™i e~ C£2 ^ Kh '~ Kh ^ 2 ) 

h^T-L h'<h 

where e < 1 — a 1 / 3 . If TL — {e~ k ,a n < k < b n } and the kernel is Gaussian, 
rectangular, Epanechnikov or biweight, Q v = e and 0 2 = 3 work. 

This results implies Theorem [3j since under (Kl), \\Kh>—Kh\\ 2 = ^jr-(j>(h/h') > 
(min^ 0)-^r- as soon as h > hi , so that 


EE e -C\\K h ,-K h \\* < |^|2 e -C'Amax_ 
h&n h'<h 

Let e E (0,1) such that a < (1 — s) 3 and 

3 ,o^ <K^) - </>(0i) -a/ 6 1 + a/6 2 

m )+m) 

(possible since a < 1 < (f>(e 2 ) — (j>{6 i))/(l/ 6 *i — 1 / 6 * 2 )). Let us decompose 
fh' ~ fh — ( fh' ~ fh> ~ fh + fh ) + (fh' ~ fh ) = S(h, hi) + (f h t - f h ) 


with 

1 n 

S(h, h!) = - Y(K h , - K h )(x - Xf) - E((AV - K h )(x - X\)) 

1=1 


and the bias term \\f h , - f h \\ < sup ft/ < /t \\K h > * f - K h * f\\ = D(h). First 
write 

(l-£)||S(M')l| 2 -(d - 1) D(hf < II/,,-All 2 < (l+e)l|S(A.A')f+(l+ )) D(h ) 2 

Now we shall prove that with high probability 


(l- £ )- 


K h ' ~ K h 


n 


<||E(h,A)||<(l + e) J 


K h , - K h 


n 
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First, we can prove as in Section [2] that for all h! < h 


P(J|S(M')ll>(i + £) 

< max I exp 


K h > ~ K h 


E 2 A £ 


24 


n 


n , exp 


24||/|| c 


K h , - K h 


Next, we shall use (J5]) in Lemma [2] in order to lowerbound \\S(h, h')\\. Recall 
that \\S(h,h')\\ = sup tgB u{t) where B is the unit ball in L 2 and u(t) = 
J££= 1 9t(Xi) - E (gtiXi)) with g t {X) = f(K h , - K h )(x - X)t(x)dx. With 
notations of Lemma [21 we have b = ||— A^||, H’ 2 = n~ 1 \\Kf l ' — Kh\\ 2 
and v = 4||A'|| 2 ||/|| 0O . It remains to lowerbound H. First, remark, that (j7]) 
provides nE(sup teS v 2 (t )) = \\K h > - K h \\ 2 - || (K h > - K h ) * f\\ 2 . Next, using 


.bH 


E(sup v 2 (t)) <- + 4-1- H 2 < - +[H+ — 


teB 


n 


n 


n 


2b 


n 


Then 


2b 


n [ H -\ -1 > nE(sup v 2 (t))—v = || K h ,-K h \\ 2 -\\ (K h ,-K h )*f\\ 2 -4\\K\\{ 

n ) teB 


which implies 

+ > ^/iiav- Avii 2 -4im-|ifdi/iu + ii/ii 2 )- 

Since b = ||AV — Kh\\, 


H > 


II K h , - K h ||2 - 4||AT||f(ll/IU + ||/|| 2 ) 2||AV - K h 


n 


n 


Now, for h! <h 


so 


H > 


K h , - K h 
y/n 



4||Al 2 (||/||oo + ||/|| 2 ) 


II Kh 1 — K h \ 



h - £ - h > >H '( !i 4 n^iii(ii/ii-+n/ii 2 ) 

3 “ \\ \\K h > - K h \\ 2 
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—>■ oo 


From (Kl), \\K h , - K h f = ^f<f>(h/ti) > (min En (p)^f- > £ 
and, in consequence, for n large enough 

H - £ -H’ > H' (1 - e). 

Thus for n large enough 

p(||S(M0II<(1-+)£^£) (10) 

< max (exp ,exp “ A ' A " 2 )) 

Let 5(h, h) = 0 and, if h ^ ti, 

Hh,h') — 2max (exp ,exp £ x J IAe - ^l| 2 )) • 

We just proved that for n large enough, with probability larger than 1 — 

8{h, ti) 


(1 - s ) 2 ^-—< || S(h, ti) || 2 < (1 + e ) 2 ^—— 
n n 

Next, with probability larger than 1 — $£<££ £ 


B(h ) > sup A ,< h 
B{h ) < sup h '< h 


(1 - £) 3 ll A 

v ' n n 

(1 , c N3 ll^/-^ll 2 _ IIAVII 2 

\ ' n n 


D W 2 

+ (1+1) D(hf 


But, if h min small enough, for A > a 

|\|| K h ,-K h f \\K h ,\\ 2 ] 
sup A- a 1 - = 

h'<h _ TI TI _ _j_ 

Indeed, for x = ti/h < 1 

jiA+-/uii 2 j/f+r = jia -|| 2 

n n nh 

JAT 

nh 


A 


II K h 


K h 


— a 


II K h 


n 


n 


(i+ 1 ~; /A - 2 

(«-¥) 


(K,K(x.)) ) 

I|A'|| 2 ) 
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and the function 4>(x) — — tends to +oo when x —* 0 and is decreasing in 
some neighborhood of 0 (assumption (K2)). Then with probability larger 
than 1 - h'), for all h 


Crit (h) > 
Crit(h) < pp 


min 

2 


a 

min 


—a + (1 — e) 3 (j)(h/h m i n ) + 
—a + (1 + £) 3 4>(h/h m [ n ) + 


-1 )D{hf 
+ (1 + \)D{hf 


In particular, for h = 9ih n 


Crit ( 6*1 h min ) < 


\m 

Tlhm\r 


-a + (1 + e) 3 0 ( 0 i) + — 


+ ( 1 + - J sup D(h) . 


( 11 ) 

Moreover, since a < (1 — s) 3 , (1 — e) 3 cj)(x) + - is increasing for x > 2 
(assumption (K3)). This implies that 


Vh > 0 2 h min , Crit (h) > 


JM 

Tih ni; t 


-a + (1 — e) 3 (p(9 2 ) + — 

V2 


-1 sup D(hy. 

£ / h 

( 12 ) 


Since ( K h ) is an approximation to the identity, \\f — K h * f\\ tends to 0 
when h tends to 0. This implies that D(h ) < 2sup h , <h \\f — K h > * f\\ tends 
to 0 and snp hen D(h ) tends to 0, as soon as /i max tends to 0. Now ([9]) leads 
to A := (1 — e) 3 (j)(9 2 ) + — (1 + e) 3 (j)(0i) — > 0. Then, for n large enough, 


(2/e) sup h D(h ) 2 < pp-A so that 


J \Kf 

n/p-nji 


-a + (1 + e) 3 (f)(6i) + — 
Vi 


+ ( 1 + - J sup D(Ky 


< 


JW 

n/imir 


—a + (1 — e) 3 cj)(9 2 ) + — 

^2 


(13) 


— (- 1 J sup D(hy 


Finally, combining (TIT]) and (1T2]l and (TT3|l gives h < 9 2 h min with probability 
larger than 1 - J2 h 5(/i, h'). 

Let us now prove the second part of Theorem [31 that is the lower bound 
on the risk. Let A n = {h < 3/i min } and B n = n hGn {\\f h - fh\\ > We 

have just proved that P(Ajj) < C{\ogn) 2 exp(— (logn) 2 /C). In the same way 
that m, we can write for n large enough 


P ||/ h -All <(l-e) 


\K h 


< max ( exp 


£ 2 A (3e) 
24 x 9 


n , exp 


6 x 9II/IU 


\K h \\ 2 
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which implies P(£?£) < C"(logn) exp(— (logn) 2 /C') and then 

¥(A n nB n ) > l-o(l). 

Then we can write 

11/-All > 114- fh\\t An nB n ~ Wf-fhW 

> min ||4 - fhUA n nB n - max \\f - f h || 

ri<ori m i n h 

> , min ^- max \\f - f h \\ 

h<3h min Z y/n h 

> I41| 1 


'±"——1^ -max ||/-M| 

2y3 Vnh min h 


But max h ||/ - f h \\ -A- 0 (since h max -A- 0), and nh min 
Hence 


—> 1 when n —y 


00. 


which proves that E||/ — A|| > for n large enough. 


6.3. Proof of Lemma [/] 

To prove Lemma [H it is sufficient to do computations on integrals. We 
obtain: 

a - if K is the Gaussian kernel, 

44(4) rr- 

II All 2 Vl + x 2 ' 


b - 


if K is the rectangular kernel, 

41 K(x.)) 

I4II 2 


1 

- A 1. 

x 


c - if K is the Epanechnikov kernel, 


444)) = 5 (1 \ 

14II 2 4 U ) 
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d - if K is the biweight kernel: 


(K,K(x.)) 

HAT 



These formulas permit to verify all the assumptions. 
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